1 Introduction

This tutorial demonstrates a (semi-)automated method for downloading grey literature, using Technology Appraisal documents from the National Institute for Health and Care Excellence (NICE) as an example.

The same web-scraping approach can be used to download documents where patterns in PDF/HTML URLs can be identified across various types of grey literature (i.e., not limited to NICE documents). For exceptions that don’t follow regular patterns, URLs can be extracted from the page source of static web pages. While example code for this second approach is provided in the appendix, a more detailed tutorial may be included in a future update.

Once the files are downloaded, manual checks will still be required during the literature review process. However, this method will save significant time compared to downloading each document individually.

For simplicity, this tutorial uses the systematic download of 10 TAs as an example, but it can easily be applied to download hundreds of TAs at once.


2 Rationale

This tutorial is based on code originally developed for a systematic review conducted in 2020 (see Appendix). The motivation for creating this code was the lack of platforms capable of systematically downloading grey literature for review purposes. The code was initially designed to screen/review 460 NICE Technology Appraisals (TAs) related to health technology assessments (HTAs) that discuss treatment sequences.

Conference presentations on this topic are available below, and a journal publication is in progress.

Chang JYA, Latimer LR, Gillespie D, Chilcott J. Prevalence, Characteristics, and Key Issues of Modelling Treatment Sequences in Health Economic Evaluations Virtual ISPOR Europe 2020, Nov 16-19, 2020. [Poster Presentation]

Chang JYA, Chilcott JB, Latimer, NR. Exploring Data-Driven Challenges in Modelling the Effectiveness of Treatment Sequences in Health Economic Evaluations The 15th IHEA World Congress on Health Economics, July 8-12, 2023. [Oral Presentation]

Full results of the systematic review can be found in Chapter 3 of my PhD thesis.

Chang, JYA. Investigating the Application of Causal Inference Methods for Modelling the Impact of Treatment Sequences in Health Economic Evaluations 2024, Doctoral dissertation, University of Sheffield.



Now, let’s begin with the method.



3 Locate the TA indices for review on the NICE website

First, navigate to the NICE website. Identify the indices of TAs (i.e., TA numbers) that you are interested in reviewing.

Here, I use Type 2 diabetes TAs as an example. There are 10 relevant appraisals, as shown in the screenshot below, including one that was terminated: TA1006(terminated), TA924, TA877, TA583, TA572, TA288, TA418, TA390, TA336, TA315



4 Create a vector of all TAs of interest

One TA (TA1006) was terminated, so a vector of the indices for the remaining 9 TAs of interest is created:

# Create a vector containing indices of TAs that you are interested in reviewing
TA_vector <- c(924, 877, 583, 572, 288, 418, 390, 336, 315)
# sort the TA_vector (if it's not sorted)
TA_vector <- sort(TA_vector, decreasing = T)


Note: Alternatively, you can download the TA recommendation list from this NICE webpage and clean it in R or Excel (e.g., for title screening). You can then import the list into R. This approach is particularly useful if there are many entries (e.g., more than 20 TAs) (see example in Appendix). The NICE TA recommendation list includes removed and terminated TAs, while the active TA search list may not always show those that have been replaced or terminated.

5 Investigate URL Patterns for targeted PDF downloads

In this example, the aim is to download and review three key documents for each TA:

1. Company/Manufacturer Submission report (CS)
2. Evidence Review Group (ERG) Report / Assessment Group (AG) Report / External Assessment Group (EAG) Report
3. Final Appraisal Determination (FAD)


For illustration, let’s use “TA877: Finerenone for treating chronic kidney disease in type 2 diabetes” as an example. On the document history page of TA877, you can find all the documents related to this appraisal (see screenshot below).




In recent TAs (i.e., approximately after TA350), the initial CS report and ERG/AG/EAG report are typically included in the committee papers (CP) under the draft guidance or initial consultation section, while the final FAD is usually found in the final draft guidance section (see the red highlighted boxes in the screenshot above).

Note1: Multiple versions of these documents may exist if the appraisal went through multiple stages, such as an appeal, but we’re simplifying here. Nevertheless, in this tutorial, we will use slight variations in the URL to download both the initial and subsequent versions of CS and ERG/EAG/AG documents within a TA.

Note2: Handling earlier TAs, which may require different URL structures for CS and ERG/EAG/AG documents, is not covered here but is included in the code in Appendix for reference.

5.1 URL for the initial CPs

The screenshot below shows the content list and URL (green highlighted box) for the pdf of the committee papers under the draft guidance of TA877, as previously mentioned. The red boxes highlight the locations of the CS report and the EAG report.




The URL pattern for CPs of different TAs typically follows the following structure:
https://www.nice.org.uk/guidance/ta877/documents/committee-papers

That is, to access the CPs for other TAs, simply replace “877” with the relevant TA index to access the initial CP for other TAs.


5.2 URL for the final FADs

The screenshot below shows the PDF of the FAD in the final draft guidance of TA877 and its URL (green highlighted box).




The URL pattern for the final FAD of different TAs typically follows the following structure:
https://www.nice.org.uk/guidance/ta877/documents/final-appraisal-determination-document

That is, to access the FAD for other TAs, simply replace “877” with the relevant TA indices.


6 Automated download of multiple PDFs from an URL list

Now that we have the URL pattern, we can automate the download of multiple PDFs at once. The steps are as follows:

1. Build a download function: We’ll create a function that automates the download of PDFs based on a list of URLs.
2. Prepare URL lists and bulk download the PDFs: Next, we’ll create separate lists of URLs for each document type (i.e., CPs, FADs). We’ll use the function to download PDFs from each URL list for different document types.
3. Track downloads: Create files to track which documents were successfully downloaded and which failed within each TA.


6.1 Build an automated download function

Below is the function we’ll use to automate this process:

# Create a function for downloading files and return a data frame showing download status for each bulk download
file_download <- function(TA_number, url, file_name) {
error_vec <- rep("NA", times = length(TA_number))  # Initialise error status
download_status <- data.frame (TA = TA_number,  # Create status tracking data frame
                               Status = as.character(error_vec))
for (n in 1:length(TA_number)) {
    tryCatch({
    download.file(url[n], destfile = file_name[n], mode="wb") # Download file
    download_status$Status[n] <- "downloaded" # Mark as downloaded if successful
    }, error = function(e){
         # Silently handle the error for the purpose of the tutorial
         NULL # See Appendix of how it can be customised 
         })
}
download_status[download_status$Status != "downloaded", "Status"] <- "NA"
return(download_status)  # Return download status data frame
}


This function automates the download process, making it easy to download multiple PDFs by simply providing a list of URLs. In the following sections, we’ll create lists of URLs for different document types and use this file_download function to download them efficiently.

6.2 Prepare URL lists and bulk download

6.2.1 CP (CS and ERG/EAG/AG documents)

For CPs (which include CS and ERG/EAG/AG documents), here’s how to create a list of all 9 TAs from Section 3 in a vector and download them together using the file_download function:

# Based on the aforementioned URL patterns, generate URLs and file names for all 9 type 2 diabetes TAs using the TA_vector and download them together using the file_download function
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers", sep = "")
file_name <- paste("TA", TA_vector, "_CP.pdf", sep = "") # To save the files in the "out" folder and add a "CP" suffix in their file name

# Use the file_download function to download all committee papers
Fulltext_CP <- file_download(TA_number = TA_vector, url = url, file_name = file_name)

# Check how many files were successfully downloaded
length(Fulltext_CP[Fulltext_CP$Status == "downloaded", "TA"])
## [1] 6
# Show the download status (use this only for small lists; otherwise, check in R).
Fulltext_CP
##    TA     Status
## 1 924 downloaded
## 2 877 downloaded
## 3 583 downloaded
## 4 572 downloaded
## 5 418 downloaded
## 6 390 downloaded
## 7 336         NA
## 8 315         NA
## 9 288         NA


The first bulk download attempt using the identified CP URL pattern successfully retrieved 6 out of 9 CP reports, as indicated by the “Status” in the download tracking list (i.e., Fulltext_CP). According to the tracking list, TA336, TA315, and TA288 could not be downloaded using the same URL pattern (i.e., displaying NA). These may need to be downloaded manually or by finding alternative URL patterns using the method outlined in the Appendix.

For now, we can attempt to download subsequent (i.e., non-initial) CP documents by using the same method to identify URLs, experimenting with different variations. In some cases, a higher suffix (e.g., -2) may actually correspond to the initial reports, which can be confirmed upon reviewing the file content. The goal is to download all available versions of the CP documents wherever possible.

# CP that had an URL ending of 2
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-2", sep = "")
file_name <- paste("TA", TA_vector, "_CP2.pdf", sep = "")
Fulltext_CP2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP2[Fulltext_CP2$Status == "downloaded","TA"]) 
## [1] 3
Fulltext_CP$Status2 <- Fulltext_CP2$Status # Amend status in the main download tracking list
Fulltext_CP
##    TA     Status    Status2
## 1 924 downloaded         NA
## 2 877 downloaded downloaded
## 3 583 downloaded         NA
## 4 572 downloaded         NA
## 5 418 downloaded downloaded
## 6 390 downloaded downloaded
## 7 336         NA         NA
## 8 315         NA         NA
## 9 288         NA         NA


For CP documents that had an URL ending in 2, three additional reports from TA877, TA418, and TA390 were downloaded, as indicated by “Status2” in the CP download tracking list.

# CP that had an URL ending of 3
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-3", sep = "")
file_name <- paste("TA", TA_vector, "_CP3.pdf", sep = "")
Fulltext_CP3 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP3[Fulltext_CP3$Status == "downloaded","TA"]) # Amend status in the main download tracking list
## [1] 2
Fulltext_CP$Status3 <- Fulltext_CP3$Status 
Fulltext_CP
##    TA     Status    Status2    Status3
## 1 924 downloaded         NA downloaded
## 2 877 downloaded downloaded         NA
## 3 583 downloaded         NA         NA
## 4 572 downloaded         NA         NA
## 5 418 downloaded downloaded downloaded
## 6 390 downloaded downloaded         NA
## 7 336         NA         NA         NA
## 8 315         NA         NA         NA
## 9 288         NA         NA         NA


For CP documents ending in 3, two additional reports from TA924, TA418 were downloaded, as indicated by “Status3” in the CP download tracking list..

# CP that had an URL ending of 4
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-4", sep = "")
file_name <- paste("TA", TA_vector, "_CP4.pdf", sep = "")
Fulltext_CP4 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP4[Fulltext_CP4$Status == "downloaded","TA"])  # Amend status in the main download tracking list
## [1] 0
Fulltext_CP$Status4 <- Fulltext_CP4$Status 
Fulltext_CP
##    TA     Status    Status2    Status3 Status4
## 1 924 downloaded         NA downloaded      NA
## 2 877 downloaded downloaded         NA      NA
## 3 583 downloaded         NA         NA      NA
## 4 572 downloaded         NA         NA      NA
## 5 418 downloaded downloaded downloaded      NA
## 6 390 downloaded downloaded         NA      NA
## 7 336         NA         NA         NA      NA
## 8 315         NA         NA         NA      NA
## 9 288         NA         NA         NA      NA


From the CP download tracking list, we know that for CP documents with a URL variation ending in 4, no additional downloads were successful, as indicated by “Status4” in the CP download tracking list (i.e., Fulltext_CP). This suggests that the highest valid URL for this set of TAs may end at 3 (rather than 4). Further testing could be done with a URL variation ending in 5 for real cases, while this holds true for this tutorial example.
As for TA336, TA315, and TA288, no CP documents were downloadable using similar URL patterns. These documents can either be downloaded manually or by using a more exhaustive web scraping approach, scraping all available URLs from the page source of the history document website (as shown in the Appendix). This tutorial may be updated in the future to include a detailed explanation of this part of the code. Note that documents with very different URL patterns are generally from TAs approximately before TA350, as earlier appraisals often uploaded CS documents separately from ERG/AG reports. Therefore, the more complex approach is less likely needed when reviewing only TAs numbered above 350.

The screenshot below shows that all available CP documents were downloaded to the project folder within 1-2 minutes.



6.2.2 FAD

Similar to CP documents, for FADs, here’s how to create a list of FAD-related URLs for all 9 TAs from Section 3 in a vector and download them using the file_download function:

url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document", sep = "")
file_name <- paste("TA", TA_vector, "_FAD.pdf", sep = "")
Fulltext_FAD <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_FAD[Fulltext_FAD$Status == "NA","TA"]) 
## [1] 4
Fulltext_FAD
##    TA     Status
## 1 924         NA
## 2 877 downloaded
## 3 583 downloaded
## 4 572 downloaded
## 5 418 downloaded
## 6 390 downloaded
## 7 336         NA
## 8 315         NA
## 9 288         NA


The first bulk download attempt using the identified FAD URL pattern retrieved 5 out of 9 CP reports, as indicated by “Status” in the FAD download tracking list. According to the tracking list, TA924, TA336, TA315, and TA288 could not be downloaded using the same URL pattern. These may need to be downloaded manually or by finding alternative URL patterns using the method outlined in the Appendix. These may need to be downloaded manually or by finding alternative URL patterns using the method outlined in the Appendix.

For now, we can attempt to download subsequent (i.e., non-final) FAD documents by using the same method to identify URLs, experimenting with different variations. In some cases, a higher suffix (e.g., -2) may actually correspond to the final reports (espeically in TA appeals), which can be confirmed upon reviewing the file content. The goal is to download all available versions of the FAD documents wherever possible.

# url with FAD document 2
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-2", sep = "")
file_name <- paste("TA", TA_vector, "_FAD2.pdf", sep = "")
Fulltext_FAD2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD2[Fulltext_FAD2$Status == "downloaded", "TA"] 
## numeric(0)
Fulltext_FAD$Status2 <- Fulltext_FAD2$Status
Fulltext_FAD
##    TA     Status Status2
## 1 924         NA      NA
## 2 877 downloaded      NA
## 3 583 downloaded      NA
## 4 572 downloaded      NA
## 5 418 downloaded      NA
## 6 390 downloaded      NA
## 7 336         NA      NA
## 8 315         NA      NA
## 9 288         NA      NA


For FAD documents with a URL variation ending in 2, no additional downloads were successful, as indicated by “Status2” in the updated download tracking list for FAD documents. This suggests that the highest valid URL for this set of TAs may end at 2. Further testing could be done with a variation ending in 3 for real cases, but this holds true for this tutorial example.
As for TA924, TA336, TA315, and TA288, no FAD documents were downloadable using similar URL patterns. These documents can either be downloaded manually or by using a more exhaustive web scraping approach, scraping all available URLs from the page source of the history document website (as shown in the Appendix). This tutorial may be updated in the future to include a detailed explanation of this part of the code. Note that these documents are typically from approximately before TA350, so the more complicated alternative approach is less likely needed when reviewing TAs numbered above 350.


6.3 Create files to store TA titles and track CP/FAD downloads

Create files to store the TA titles (scraped from the NICE website) and track the downloads for CP and FAD documents. These tracking lists can serve as a guide for identifying which TAs may need further manual downloads.

# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")

# create dataframe for storage
title <- data.frame(TA = TA_vector,
                    title = rep(NA, times = length(TA_vector)))
for(n in 1:length(TA_vector)){
    tryCatch({
    # read html
    webpage <- read_html(url[n])
    # Using CSS selectors to scrape the rankings section
    title_data_html <- html_nodes(webpage,'#content-start')
    title$title[n]  <- html_text(title_data_html)
    }, error = function(e){
    cat("ERROR :",conditionMessage(e), "\n")
    })
  
  # random sleeping time
  sleepy = sample(c(0.5:2.5), 1)
  cat("\n let's just wait for",sleepy,"seconds...")
  Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
  
  # Clean
  title$title <- gsub("\\n", "", title$title)
  title$title <- trimws(title$title)
}  
## 
##  let's just wait for 0.5 seconds...
##  let's just wait for 2.5 seconds...
##  let's just wait for 2.5 seconds...
##  let's just wait for 2.5 seconds...
##  let's just wait for 2.5 seconds...
##  let's just wait for 2.5 seconds...
##  let's just wait for 2.5 seconds...
##  let's just wait for 0.5 seconds...
##  let's just wait for 2.5 seconds...
# Store title
write.csv(title, "Title.csv")

# Store FAD & CP download tracking list
write.csv(Fulltext_CP, "Fulltext_CP.csv")
write.csv(Fulltext_FAD, "Fulltext_FAD.csv")


Note: The waiting messages indicate pauses between each attempt to scrape the NICE website, as sending queries too quickly can result in a temporary IP ban from downloads.

Below is an example of how the file that records the title of each TA (Title.csv) appears:


Here’s a snapshot of the CP download tracking list (Fulltext_CP.csv):


And here is the FAD download tracking list (Fulltext_FAD.csv):


These lists help track which files were automatically downloaded and identify where manual downloads may be necessary.

Below is an example of the manually updated download tracking list (CP and FAD) from the aforementioned systematic review, following verification of the downloaded files. The red highlighted boxes indicate where manual downloads were necessary.



For alternative methods to automate downloads when none of the documents within a TA successfully download using the usual URL patterns (i.e., all documents for TA336, TA315, and TA288, and the FAD document for TA924), refer to the code in the Appendix. :)

Finally, here is a screenshot of all the files downloaded and saved during this tutorial.



7 Appendix: code for the 2020 systematic review of treatment sequences in health technology appraisals

The following original code for our 2020 systematic review of treatment sequences in health technology appraisals, designed to explore their prevalence, characteristics, and key issues in modeling treatment sequences within health economic evaluations.

The list of reviewed TAs required to initialize the code is available in the GitHub repository as TA_list_20191201.csv

Disclaimer: Please note that the code was developed before the era of large language models (LLMs) and may not be highly efficient (and yes there might be typos! Please get in contact if you spot any, thanks!), but it served its purpose at the time! ;)

########################################################################
# Project: Review NICE TA regarding treatment sequencing problem
#          Automate the process of downloading TA documents
#          Manually downloading e.g. 60 files can take up to 2 hours 
#          and may lead to mistakes due to incorrect manual entries.
# Create: JY Amy Chang
# Date: 02Mar2020
########################################################################

# Download pdf files
install.packages("XML")
install.packages("bitops")
install.packages("RCurl")
install.packages("httr")
install.packages("xml2")
install.packages("rvest")
install.packages("stringr")
install.packages("truncnorm")

library(bitops)
library(RCurl)
library(XML)
library(httr)
library(xml2)
library(rvest)
library(stringr)
library(pagedown)
library(truncnorm)


# import active TA list for review
TA_list <- read.csv(file = "raw/TA_list_20191201.csv", header = FALSE)
TA_vector <- as.vector(TA_list[,"V1"])

# transform TA list into vectors for creating ulr and file name for bulk download
# sort vector to download from lastest to the oldest
TA_vector <- sort(TA_vector, decreasing = T)

# create a function for downloading files, return data frame error_vec that indicates which files are not downloaded
file_download <- function(TA_number, url, file_name) {
error_vec <- rep("NA", times = length(TA_number)) # 460
download_status <- data.frame (TA = TA_number,
                                 Status = error_vec)
download_status$Status <- as.character(download_status$Status)
for (n in 1:length(TA_number)) {
    tryCatch({
    download.file(url[n], destfile = file_name[n], mode="wb")
    download_status$Status[n] <- "downloaded"
    }, error = function(e){
         cat("ERROR :",conditionMessage(e), "\n")
         })
}
download_status[download_status$Status != "downloaded", "Status"] <- "NA"
return(download_status)
}

###########################################
# FAD (Final Appraisal Determination)
###########################################

# download FAD #not all file use the same logic of url
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document", sep = "")
file_name <- paste("TA", TA_vector, "_FAD.pdf", sep = "")
Fulltext_FAD <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_FAD[Fulltext_FAD$Status == "NA","TA"]) # 259 undownloaded
write.csv(Fulltext_FAD, "Fulltext_FAD.csv")

# url with document 2
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-2", sep = "")
file_name <- paste("TA", TA_vector, "_FAD2.pdf", sep = "")
Fulltext_FAD2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD2[Fulltext_FAD2$Status == "downloaded", "TA"] # 42 downloaded
Fulltext_FAD$Status2 <- Fulltext_FAD2$Status

# test if there is document 1
# url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-1", sep = "")
# file_name <- paste("TA", TA_vector, "_FAD1.pdf", sep = "")
# Fulltext_FAD1 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
# Fulltext_FAD1[Fulltext_FAD1$Status == "downloaded", "TA"] # 0 downloaded

# url with document 3
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-3", sep = "")
file_name <- paste("TA", TA_vector, "_FAD3.pdf", sep = "")
Fulltext_FAD3 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD3[Fulltext_FAD3$Status == "downloaded", "TA"] # 9 downloaded
Fulltext_FAD$Status3 <- Fulltext_FAD3$Status

# url with document 4
url <- paste("https://www.nice.org.uk/guidance/ta",TA_vector,"/documents/final-appraisal-determination-document-4", sep = "")
file_name <- paste("TA", TA_vector, "_FAD4.pdf", sep = "")
Fulltext_FAD4 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_FAD4[Fulltext_FAD4$Status == "downloaded", "TA"] # 2 downloaded
Fulltext_FAD$Status4 <- Fulltext_FAD4$Status #[1] 491 487 these two has more documents due to CDF and managed access


#############################
# CP (Committee Paper)
#############################

# Committee paper is more complicated than FAD as the first committee paper is usually consultation document 

# CP document 1
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers", sep = "")
file_name <- paste("TA", TA_vector, "_CP.pdf", sep = "")
Fulltext_CP <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP[Fulltext_CP$Status == "downloaded","TA"]) # 204 undownloaded
#write.csv(Fulltext_FAD, "Fulltext_CP.csv")

# CP document 2
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-2", sep = "")
file_name <- paste("TA", TA_vector, "_CP2.pdf", sep = "")
Fulltext_CP2 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP2[Fulltext_CP2$Status == "downloaded","TA"]) # 136 undownloaded
Fulltext_CP$Status2 <- Fulltext_CP2$Status 

# CP document 3
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-3", sep = "")
file_name <- paste("TA", TA_vector, "_CP3.pdf", sep = "")
Fulltext_CP3 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP3[Fulltext_CP3$Status == "downloaded","TA"]) # 63 undownloaded
Fulltext_CP$Status3 <- Fulltext_CP3$Status 

# CP document 4 (sometimes it can be just slides)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-4", sep = "")
file_name <- paste("TA", TA_vector, "_CP4.pdf", sep = "")
Fulltext_CP4 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
length(Fulltext_CP4[Fulltext_CP4$Status == "downloaded","TA"]) # 32 downloaded
Fulltext_CP$Status4 <- Fulltext_CP4$Status 

# CP document 5 (can be managing access agreement)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-5", sep = "")
file_name <- paste("TA", TA_vector, "_CP5.pdf", sep = "")
Fulltext_CP5 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP5[Fulltext_CP5$Status == "downloaded","TA"] # 17 downloaded
# [1] 588 541 510 502 495 491 484 483 479 474 473 472 445 432 423 417 402
Fulltext_CP$Status5 <- Fulltext_CP5$Status 

# CP document 6 (can be CDF glossry)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-6", sep = "")
file_name <- paste("TA", TA_vector, "_CP6.pdf", sep = "")
Fulltext_CP6 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP6[Fulltext_CP6$Status == "downloaded","TA"] # 7 downloaded
# [1] 510 484 483 479 474 473 402
Fulltext_CP$Status6 <- Fulltext_CP6$Status

# CP document 7 (can be CDF glossry) (for those who had 2 times consultation: seems like the maximum)
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-7", sep = "")
file_name <- paste("TA", TA_vector, "_CP7.pdf", sep = "")
Fulltext_CP7 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP7[Fulltext_CP7$Status == "downloaded","TA"] # 3 downloaded
# [1] 484 474 473
Fulltext_CP$Status7 <- Fulltext_CP7$Status

# CP document 8 sorabenib, email
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-8", sep = "")
file_name <- paste("TA", TA_vector, "_CP8.pdf", sep = "")
Fulltext_CP8 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP8[Fulltext_CP8$Status == "downloaded","TA"] # 2 downloaded
# [1] 474 473 
Fulltext_CP$Status8 <- Fulltext_CP8$Status

# CP document 9 sorabenib, email
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-9", sep = "")
file_name <- paste("TA", TA_vector, "_CP9.pdf", sep = "")
Fulltext_CP9 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP9[Fulltext_CP9$Status == "downloaded","TA"] # 2 downloaded
# [1] 474 473
Fulltext_CP$Status9 <- Fulltext_CP9$Status

# CP document 10 sorabenib, email
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/documents/committee-papers-10", sep = "")
file_name <- paste("TA", TA_vector, "_CP10.pdf", sep = "")
Fulltext_CP10 <- file_download(TA_number = TA_vector, url = url, file_name = file_name)
Fulltext_CP10[Fulltext_CP10$Status == "downloaded","TA"] # 2 downloaded
# [1] 473
Fulltext_CP$Status10 <- Fulltext_CP10$Status

# There are further documents of TA 473 up to document 12 (but will not download it due to irrelavence)

####################################################################
# FIND TERMINATED appraisals
####################################################################

# SOLVED WITH HELP FROM 
# TUTORIAL https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/
# https://selectorgadget.com/

# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")

# create dataframe for storage
title <- data.frame(TA = TA_vector,
                    title = rep(NA, times = length(TA_vector)))
for(n in 1:length(TA_vector)){
    tryCatch({
    # read html
    webpage <- read_html(url[n])
    # Using CSS selectors to scrape the rankings section
    title_data_html <- html_nodes(webpage,'#content-start')
    title$title[n]  <- html_text(title_data_html)
    }, error = function(e){
    cat("ERROR :",conditionMessage(e), "\n")
    })
  
  # random sleeping time
  sleepy = sample(c(0.5:2.5), 1)
  cat("\n let's just wait for",sleepy,"seconds...")
  Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
}  

title[title$title == "NA","TA"] # none (all titles are downloaded)

# update FAD & CP document

Fulltext_CP[grep("terminated", title$title), 2:10] <- "terminated"
write.csv(Fulltext_CP, "Fulltext_CP_20200305.csv")

Fulltext_FAD[grep("terminated", title$title), 2:5 ] <- "terminated"
write.csv(Fulltext_FAD, "Fulltext_FAD_20200302.csv")


##############################################
# Create a file to store TA titles
##############################################

# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")

# create dataframe for storage
documents <- data.frame(TA = TA_vector,
                    title = rep(NA, times = length(TA_vector)))
for(n in 1:length(TA_vector)){
  tryCatch({
    # read html
    webpage <- read_html(url[n])
    # Using CSS selectors to scrape the rankings section
    title_data_html <- html_nodes(webpage,'#content-start')
    title$title[n]  <- html_text(title_data_html)
  }, error = function(e){
    cat("ERROR :",conditionMessage(e), "\n")
  })
  
  # random sleeping time
  sleepy = sample(c(0.5:2.5), 1)
  cat("\n let's just wait for",sleepy,"seconds...")
  Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
}  

write.csv(documents, "Title_20200302.csv")


##############################################
# trying to download earlier FAD & CP
# less regularly named
##############################################
# https://stackoverflow.com/questions/57008774/r-help-me-to-scrap-links-from-webpage
# https://i.stack.imgur.com/bISth.jpg
  
# the key point is to find everything is under the big trunk of <ul class="media-list">
base <- 'https://www.nice.org.uk'
url  <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")
# not using the following code because it produces only the relative link
# links <- read_html(url) %>% html_nodes(., ".media-list a") %>% html_attr(., "href") 

# default to download committee-papers (".media-list a" need to be changed in the function if it is not stored there)
link_download <- function(TA_number = 1:length(TA_vector), url = url, search_term = "committee-papers") {
  # create vacant table for storing links
  links_df <- data.frame(TA = TA_vector,
                         link   = rep(NA, times = length(TA_vector)),
                         link2  = rep(NA, times = length(TA_vector)),
                         link3  = rep(NA, times = length(TA_vector)),
                         link4  = rep(NA, times = length(TA_vector)),
                         link5  = rep(NA, times = length(TA_vector)),
                         link6  = rep(NA, times = length(TA_vector)),
                         link7  = rep(NA, times = length(TA_vector)),
                         link8  = rep(NA, times = length(TA_vector)),
                         link9  = rep(NA, times = length(TA_vector)),
                         link10 = rep(NA, times = length(TA_vector)))
  
for(n in c(TA_number)){
    tryCatch({
      # output links of 
      links <- url_absolute(read_html(url[n]) %>% html_nodes(., ".media-list a") %>% html_attr(., "href"), base)
      links_df_temp <- links[grep(search_term, links)]
      links_df[n, 2:(length(links_df_temp)+1)] <- t(links_df_temp)
    }, error = function(e){
      cat("ERROR :",conditionMessage(e), "\n")
    })
    
    # random sleeping time
    sleepy = rtruncnorm(n = 1, a = 0.000001, mean = 0.8, sd = 0.3)
    cat("\n let's just wait for",sleepy,"seconds...")
    Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
  }  
  return(links_df)
}

title <- read.csv(file = "output/Title_20200302.csv", header = TRUE)
title$X <- NULL 
TA_vector_index <- grep("terminated", title$title)

# download FAD links (FAD links normall will link to pdf document, html FAD overview will not have "final-appraisal-determination" as link )
FAD_links <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "final-appraisal-determination")
summary(as.factor(rowSums(!is.na(FAD_links[ , 2:22])))) # 61 TAs that has no links (less than TA that terminated)
# 0   1   2   3   4   5   9  12  21 
# 61 184 185  11  11   4   2   1   1 
# reduced link up to 5 and manual download those more than 5 links of not corretly downloaded
FAD_links[ , 7:22] <- NULL

#download function with links matrix
file_download2 <- function(TA_number = c(1:length(TA_vector)), url = FAD_links, file_name){
  for (n in c(TA_number)) {
  url_vector <- na.omit(unlist(url[n, 2:length(url[1,])]))
    if (length(url_vector) != 0) {
      for (k in 1:length(url_vector)) {
        tryCatch({
          download.file(url_vector[k], destfile = file_name[n, k], mode="wb")
          download_status[n, k + 1] <- "downloaded"
        }, error = function(e){
          cat("ERROR :",conditionMessage(e), "\n")})
      }
    }
  # random sleeping time
  sleepy = round(rtruncnorm(n = 1, a = 0.000001, mean = 0.8, sd = 0.3), 1)
  cat("\n let's just wait for",round(sleepy, 1),"seconds...")
  Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
  }
  download_status <- data.frame(TA = download_status[ , 1],
                                Status = download_status[ , 2:length(url[1,])])
  return(download_status)
}

# find where files need to be re-downloaded (where there is no TA at all)
FAD <- read.csv(file = "output/Fulltext_FAD_20200302.csv", header = TRUE)
FAD$X <- NULL 

# initial file download of FAD
# use FAD as base case and add one more column
FAD_temp <- FAD
FAD_temp$Status5 <- NA
download_status <- as.matrix(FAD_temp)

file_name       <- matrix (c(paste("TA", TA_vector, "_FAD1.pdf", sep = ""),
                             paste("TA", TA_vector, "_FAD2.pdf", sep = ""),
                             paste("TA", TA_vector, "_FAD3.pdf", sep = ""),
                             paste("TA", TA_vector, "_FAD4.pdf", sep = ""),
                             paste("TA", TA_vector, "_FAD5.pdf", sep = "")), 
                             ncol = 5, byrow = F)


# Download FAD file with links # starting from TA404 there is no FAD (index 204)
TA_vector_index <- 1:length(TA_vector)
Fulltext_FAD_amend <- file_download2(TA_number = TA_vector_index[rowSums(!is.na(FAD[ , 2:5])) == 0], 
                                      url = FAD_links, file_name)

        #######################################################
        # create exclusion dataframe 
        # (based on fetch FAD link result)
        #######################################################
        
        # output TAnumber where there is no FAD and not terminated
        `%notin%` <- Negate(`%in%`)
        No_FAD_link <- FAD_links$TA[rowSums(!is.na(FAD_links[ , 2:22])) == 0] 
        No_FAD_link[No_FAD_link %notin% TA_vector[TA_vector_terminated]]
        
        length(FAD_links$TA[rowSums(!is.na(FAD_links[ , 2:11])) == 0]) #18
        # [1] *532(withdrawn no longer on market) 
        #     *493(replaced by NG616) 
        #     *459(withdrawn) 
        #     404(outlier "fad-document, suggest maual download) 
        #     *394(review: no FAD, no ERG) 
        #     *381(replaced by TA620 in Jan2020) 
        #     366(wrongly named, "appraisal-consultation-document", suggest manual download) 
        #     292(wrongly named, "final-appraisal-determintation-document2", manual download)
        #     266(wrongly named, "final-appraisal-determinaton") 
        #     264(wrongly named, "final-appriasal-determination", manual download)  
        #     55(no FAD, but assessment report)  
        #     38(no FAD, but HTA report)  
        #     34(no FAD, but Assessment report)  
        #     29(no FAD, but Assessment report)  
        #     23(no FAD, but Assessment report)  
        #     20(no FAD, but Assessment report) 
        #     10(no FAD, but Assessment report)
        #     *1(review: no FAD, no ERG)
        
        
        TA_withdrawn  <- c(532, 459)
        TA_replaced   <- c(493, 381)
        TA_noCSFADERG <- c(394, 1)
        #43 (43 terminated + 6)
        
test <- Fulltext_FAD_amend
test[2:6] <- sapply(test[2:6], as.character)
test[TA_vector_terminated, 2:6] <- "terminated"
test[TA_vector %in% TA_withdrawn, 2:6] <- "withdrawn"
test[TA_vector %in% TA_replaced, 2:6] <- "replaced"
test[TA_vector %in% TA_noCSFADERG, 2:6] <- "no_CS_FAD_ERG"
Fulltext_FAD_amend <- test

Fulltext_FAD_amend$TA[rowSums(!is.na(Fulltext_FAD_amend[ , 2:6])) == 0] #12
#  [1] 404 366 292 266 264  55  38  34  29  23  20  10 (TAs without FAD) see explanation above

# output UPDATED FAD
write.csv(Fulltext_FAD_amend, "Fulltext_FAD_20200305.csv")
write.csv(FAD_links, "Links_FAD_20200305.csv")


### download CP links (some CP link can be linked to html website but not pdf)
CP_links  <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "committee-papers")
TA_vector[rowSums(!is.na(CP_links[ , 12:13])) != 0] # TA 473 will have some undownloaded documents but it is too long

# e.g. TA 209evaluation report (seems like CP changed names)
ER_links     <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "evaluation-report")
rowSums(!is.na(ER_links[ , 2:10]))
# e.g. TA 192
ERGR_links   <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "erg-report")
rowSums(!is.na(ERGR_links[ , 2:10])) 
# e.g. TA 123
ERGR_links2  <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "evidence-review-group-report")
rowSums(!is.na(ERGR_links2[ , 2:10])) 
# TA191
ERGR_links3  <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "evidence-review-groups-report")
rowSums(!is.na(ERGR_links3[ , 2:10])) 
# e.g. TA38
HTA_links    <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "hta-report")
rowSums(!is.na(HTA_link[ , 2:10])) 
# e.g. TA75
HTA_links2   <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "health-technology-assessment")
rowSums(!is.na(HTA_link2[ , 2:10])) 
# e.g. TA278, TA61, TA 59
AR_links     <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "assessment-report")
rowSums(!is.na(AR_links[ , 2:10]))
# TA195 really long
# e.g. TA188
AR_links2    <- link_download(TA_number = 1:length(TA_vector), url = url, search_term = "assessment-group-report")
rowSums(!is.na(AR_links2[ , 2:10])) 
#


# check how many links (including CP) won't exceed 8 download spacces (can use the data frame of CP_amend to update)
summary(as.factor(rowSums(!is.na(CP_links[ , 2:11]))     +
                  rowSums(!is.na(ER_links[ , 2:11]))     +
                  rowSums(!is.na(ERGR_links[ , 2:11]))   +
                  rowSums(!is.na(ERGR_links2[ , 2:11]))  +
                  rowSums(!is.na(ERGR_links3[ , 2:11]))  +
                  rowSums(!is.na(HTA_links[ , 2:11]))    +
                  rowSums(!is.na(HTA_links2[ , 2:11]))   +
                  rowSums(!is.na(AR_links[ , 2:11]))     +  
                  rowSums(!is.na(AR_links2[ , 2:11]))))
# 0   1   2   3   4   5   6   7   8   9  10  11  13 
# 50  98 139  67  44  21  15  12   5   3   4   1   1 
# links can be stored together

summary(as.factor(  rowSums(!is.na(ER_links[ , 2:10]))     +
                    rowSums(!is.na(ERGR_links[ , 2:10]))   +
                    rowSums(!is.na(ERGR_links2[ , 2:10]))  +
                    rowSums(!is.na(ERGR_links3[ , 2:10]))  +
                    rowSums(!is.na(HTA_links[ , 2:10]))    +
                    rowSums(!is.na(HTA_links2[ , 2:10]))   +
                    rowSums(!is.na(AR_links[ , 2:10]))     +  
                    rowSums(!is.na(AR_links2[ , 2:10]))))
#   0   1   2   3   4   5   6   7   8   9  10 
# 252  58  65  27  24   9   9   7   4   4   1 

# still no files for CP at all #50
no_CP <- TA_vector_index [(as.factor(rowSums(!is.na(CP_links[ , 2:11]))     +
            rowSums(!is.na(ER_links[ , 2:11]))     +
            rowSums(!is.na(ERGR_links[ , 2:11]))   +
            rowSums(!is.na(ERGR_links2[ , 2:11]))  +
            rowSums(!is.na(ERGR_links3[ , 2:11]))  +
            rowSums(!is.na(HTA_links[ , 2:11]))    +
            rowSums(!is.na(HTA_links2[ , 2:11]))   +
            rowSums(!is.na(AR_links[ , 2:11]))     +  
            rowSums(!is.na(AR_links2[ , 2:11])))) == 0]
length(no_CP)
no_CP
no_CP <- no_CP[no_CP %notin% TA_vector_terminated] # delete those terminated  
no_CP <- no_CP[no_CP %notin% TA_vector_index[TA_vector %in% TA_noCSFADERG]] # NO CSFADERG
no_CP <- no_CP[no_CP %notin% TA_vector_index[TA_vector %in% TA_replaced]] # NO replaced
no_CP <- no_CP[no_CP %notin% TA_vector_index[TA_vector %in% TA_withdrawn]] # NO wthdrawn # 65
no_CP
# [1] 440 447
TA_vector[no_CP] 
# [1] 77 (protocol-newer-hypnotic-drugs-for-shortterm-pharmacotherapy-for-insomnia2)  
#     64 (report-by-a-consortium)


no_CP <- TA_vector_index [(as.factor(rowSums(!is.na(CP_links[ , 2:11]))     +
                                       rowSums(!is.na(ER_links[ , 2:11]))     +
                                       rowSums(!is.na(ERGR_links[ , 2:11]))   +
                                       rowSums(!is.na(ERGR_links2[ , 2:11]))  +
                                       rowSums(!is.na(ERGR_links3[ , 2:11]))  +
                                       rowSums(!is.na(HTA_links[ , 2:11]))    +
                                       rowSums(!is.na(HTA_links2[ , 2:11]))   +
                                       rowSums(!is.na(AR_links[ , 2:11]))     +  
                                       rowSums(!is.na(AR_links2[ , 2:11])))) == 0]

# find where files need to be re-downloaded (where there is no TA at all)
CP <- read.csv(file = "output/Fulltext_CP_20200305.csv", header = TRUE)
CP$X <- NULL 
CP[TA_vector_terminated , 2:11] <- NA

summary(as.factor(rowSums(!is.na(CP_links[ , 2:13])))) # 204 TAs that has no links in original CP download
# 0   1   2   3   4   5   6   7   9  12 
# 229  60  86  41  26   9   4   3   1   1 
rowSums(!is.na(CP_links[ , 2:11]))
# reduced link up to 5 and manual download those more than 5 links of not corretly downloaded
CP_links[ , 12:13] <- NULL # manual download those who had 9 and & links

# initial file download of FAD
# use FAD as base case and add one more column
CP_temp <- CP
download_status <- as.matrix(CP_temp)

file_name       <- matrix (c(paste("TA", TA_vector, "_CP1.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP2.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP3.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP4.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP5.pdf", sep = ""), 
                             paste("TA", TA_vector, "_CP6.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP7.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP8.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP9.pdf", sep = ""),
                             paste("TA", TA_vector, "_CP10.pdf", sep = "")),
                             ncol = 10, byrow = F)


# Download CP file with links 
TA_vector_index <- 1:length(TA_vector)
length(TA_vector_index[rowSums(!is.na(CP[ , 2:11])) == 0]) # 247 index need to be download use link (including termination)
Fulltext_CP_amend <- file_download2(TA_number = TA_vector_index[rowSums(!is.na(CP[ , 2:11])) == 0], 
                                   url = CP_links, file_name)
length(TA_vector_index[rowSums(!is.na(Fulltext_CP_amend[ , 2:11])) == 0]) # 225 index still need to be download use link
TA_vector_terminated <- grep("terminated", title$title)
test <- Fulltext_CP_amend
test[2:11] <- sapply(test[2:11], as.character)
test[TA_vector_terminated, 2:11] <- "terminated"
test[TA_vector %in% TA_withdrawn, 2:11] <- "withdrawn"
test[TA_vector %in% TA_replaced, 2:11] <- "replaced"
test[TA_vector %in% TA_noCSFADERG, 2:11] <- "no_CS_FAD_ERG"
Fulltext_CP_amend <- test

length(TA_vector_index[rowSums(!is.na(Fulltext_CP_amend[ , 2:11])) == 0]) # 180 index still need to be download use link

# output UPDATED CP
write.csv(Fulltext_CP_amend, "Fulltext_CP_20200305.csv")
write.csv(CP_links, "Links_CP_20200305.csv")

#####################################################################
# download the rest of ERG
######################################################################
# create url matrix for download
url_ERG <- matrix (rep(NA, 11*length(TA_vector)), byrow = T, ncol = 11)
url_ERG[ , 1] <- TA_vector
for (n in c(TA_vector_index)){
  x <- c(unname(unlist(ER_links[n, 2:11])[!is.na(unlist(ER_links[n, 2:11]))]),
         unname(unlist(ERGR_links[n, 2:11])[!is.na(unlist(ERGR_links[n, 2:11]))]),
         unname(unlist(ERGR_links2[n, 2:11])[!is.na(unlist(ERGR_links2[n, 2:11]))]),
         unname(unlist(ERGR_links3[n, 2:11])[!is.na(unlist(ERGR_links3[n, 2:11]))]),
         unname(unlist(HTA_links[n, 2:11])[!is.na(unlist(HTA_links[n, 2:11]))]),
         unname(unlist(HTA_links2[n, 2:11])[!is.na(unlist(HTA_links2[n, 2:11]))]),
         unname(unlist(AR_links[n, 2:11])[!is.na(unlist(AR_links[n, 2:11]))]),
         unname(unlist(AR_links2[n, 2:11])[!is.na(unlist(AR_links2[n, 2:11]))])
  )
  x <- x[0:min(10, length(x))]
  
  if (length(x) != 0){
    url_ERG[n, 2:(length(x)+1)] <- x
  }
}

url_ERG <- as.data.frame(url_ERG)

file_name       <- matrix (c(paste("TA", TA_vector, "_ERG1.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG2.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG3.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG4.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG5.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG6.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG7.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG8.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG9.pdf", sep = ""),
                             paste("TA", TA_vector, "_ERG10.pdf", sep = "")
                             ), ncol = 10, byrow = F)
# Download ERG file with links 
download_status[ , 2:11] <- NA
url_ERG[2:11] <- sapply(url_ERG[2:11], as.character)
Fulltext_ERG <- file_download2(TA_number = TA_vector_index, 
                               url = url_ERG, file_name)



length(TA_vector_index[rowSums(!is.na(Fulltext_ERG[ , 2:11])) == 0]) # 252 index still need to be download use link
TA_vector_terminated <- grep("terminated", title$title)
test <- Fulltext_ERG
test[2:11] <- sapply(test[2:11], as.character)
test[TA_vector_terminated, 2:11] <- "terminated"
test[TA_vector %in% TA_withdrawn, 2:11] <- "withdrawn"
test[TA_vector %in% TA_replaced, 2:11] <- "replaced"
test[TA_vector %in% TA_noCSFADERG, 2:11] <- "no_CS_FAD_ERG"
Fulltext_ERG <- test

# output UPDATED CP
write.csv(Fulltext_ERG, "Fulltext_ERG_20200305.csv")
write.csv(url_ERG, "Links_ERG_20200305.csv")



###################################
# Download TA history file
# convert html to pdf
###################################
# ref: https://rdrr.io/cran/pagedown/man/chrome_print.html
# install.packages("pagedown")

# create url
url <- paste("https://www.nice.org.uk/guidance/ta", TA_vector, "/history", sep = "")
file_name <- paste("TA", TA_vector, "_history.pdf", sep = "")
history <- data.frame (TA = TA_vector,
                       Status = rep(NA, length(TA_vector)))
indices <- 1:length(TA_vector)

download_history <- function(url, file_name, indices){
  for(n in c(indices)){
    tryCatch({
      # read html
      chrome_print(url[n], output = file_name[n])
      history$Status[n]  <- "downloaded"
    }, error = function(e){
      cat("ERROR :",conditionMessage(e), "\n")
    })
    # random sleeping time
    sleepy = sample(c(0.5:3), 1)
    cat("\n let's just wait for",sleepy,"seconds...")
    Sys.sleep(sleepy) # website will ban IP when too many queries are sent too quickly...
  }
return(history)
}

# download all history
download_history(url = url, file_name = file_name, indices = indices)
history[is.na(history$Status), "Status"] <- "NA"
history[history$Status == "NA", "TA"]
# [1] 556 550 547 507 435 434 431 362 359 353 351 350 169 167 161  34  20  10   1

# output the indicies where files are not downloaded and try again
indices <- which(grepl("NA", history$Status))
download_history(url = url, file_name = file_name, indices = indices)
history[history$Status == "NA", "TA"]
# [1] 556 434 431 167 161
# manual download these files!
history[history$Status == "NA", "Status"] <- "downloaded"

# save file
write.csv(history, "History_20200303.csv")